1 Abstract

In the early 2000s, the New York City Police Department was required to publicly report data on their stop-and-frisk program. This program allows officers to temporarily stop, question, and potentially frisk/search civilians they suspect may have committed a crime without needing evidence or a warrant. The stops may result in release, arrest, or criminal summons. As civilians can be stopped without evidence, we want to explore what variables are the most accurate prediction of the stop resulting in arrest. In order to predict arrests, we will use classification methods. WRITE MORE ABOUT METHODS, FINDINGS, and CONCLUSIONS

The goal of our project is to predict Y. edit, include key findings

2 Introduction

For over 20 years, civilians in NYC may be stopped, question, and searched by the police based solely on suspicion without any evidence. From 2002 – 2012, under Mayor Bloomberg’s administration, stops were more rampant, reaching a peak in 2011 with 685,724 reported stops (NYCLU 2024). With changes to administrations, the number of stops has decreased in recent years with 15,102 and 16,791 stops in 2022 and 2023 respectively. The stop-and-frisk program has been highly criticized as it does not require evidence to stop individuals and may allow police to use racial profiling tactics to target individuals. For example, the American Civil Liberties Union of New York (NYCLU) has criticized the program as their investigation of the data found a disproportionate number of civilians stopped are people of color, in particular Black civilians, and some stops have resulted in police misconduct. As the stops do not require previous evidence or warrants, we are interested in predicting which stops result in arrest, particularly focusing on factors gathered before the stop begins that may predict arrest, such as demographic characteristics of civilians, location, time of day, and attributes of the police officers.

  • Target variable: Our goal is to predict arrested_flag, which is a binary indicator which equals 1 if the individual who was stopped was arrested, and 0 otherwise.1

  • Motivation: edit here re relevance, cite some literature

  • Relevant features: edit

  • Methods: edit

3 Data Description

We use the New York Police Department Stop, Question and Frisk Data from 2023.

This is a stop-level dataset; each observation (row) corresponds to a unique stop made by an NYPD police officer in 2023 as part of the SQF programme.

  • How data is collected & what universe is/is not observed (some note here on the forms they have to fill out to record the stop, refer Precinct and Prejudice paper on why some stops may not actually be recorded)

  • Temporal span: the entire year of 2023.

  • Spatial span: the 77 NYPD police precincts covering all 5 boroughs of NYC.

  • Dimension: The data has 16971 observations and 82 features, as shown below.

We begin the project by installing and loading in the necessary libraries.

# Install and load required packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
library(pacman)

p_load(readxl, dplyr, ggplot2, knitr, lubridate, tidyr, sf, httr, caret, glmnet, stringr, remotes, RColorBrewer, viridis, scales, classInt, forcats, pROC, randomForest)

We proceed by setting up the file path and importing the data from our project’s GitHub repository.

# get raw content of the file 
response <- GET("https://raw.githubusercontent.com/rrachelkane/data-science-group-project/main/data/sqf-2023.xlsx")

# retrieve the .xlsx file
if (status_code(response) == 200) {
  # create a temporary file to save the downloaded content
  temp_file <- tempfile(fileext = ".xlsx")
  
  # Write the raw content to the temporary file
  writeBin(content(response, "raw"), temp_file)
  
  # Read the Excel file from the temporary file
  sqf_data <- read_xlsx(temp_file)
  
  # View the first few rows of the data
  head(sqf_data)
} else {
  stop("Failed to download the file.")
}
## # A tibble: 6 × 82
##   STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2  DAY2  STOP_WAS_INITIATED
##     <dbl> <chr>           <chr>           <dbl> <chr>   <chr> <chr>             
## 1       1 2023-01-01      00:44:00         2023 January Sund… Based on Radio Run
## 2       2 2023-01-01      00:49:00         2023 January Sund… Based on Self Ini…
## 3       3 2023-01-01      05:31:00         2023 January Sund… Based on Radio Run
## 4       4 2023-01-01      04:59:00         2023 January Sund… Based on Self Ini…
## 5       5 2023-01-01      05:21:00         2023 January Sund… Based on Self Ini…
## 6       6 2023-01-01      18:00:00         2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## #   ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## #   SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## #   SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## #   LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## #   JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## #   SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …
# check original dimensions
dim(sqf_data)
## [1] 16971    82
# view head
head(sqf_data)
## # A tibble: 6 × 82
##   STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2  DAY2  STOP_WAS_INITIATED
##     <dbl> <chr>           <chr>           <dbl> <chr>   <chr> <chr>             
## 1       1 2023-01-01      00:44:00         2023 January Sund… Based on Radio Run
## 2       2 2023-01-01      00:49:00         2023 January Sund… Based on Self Ini…
## 3       3 2023-01-01      05:31:00         2023 January Sund… Based on Radio Run
## 4       4 2023-01-01      04:59:00         2023 January Sund… Based on Self Ini…
## 5       5 2023-01-01      05:21:00         2023 January Sund… Based on Self Ini…
## 6       6 2023-01-01      18:00:00         2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## #   ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## #   SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## #   SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## #   LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## #   JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## #   SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …

4 Data Cleaning

4.1 Column Names

First, we change column names from strictly upper case to strictly lower case, because it’s cuter.

colnames(sqf_data) <- tolower(colnames(sqf_data))

# check
colnames(sqf_data)[1:3]
## [1] "stop_id"         "stop_frisk_date" "stop_frisk_time"

4.2 Relevant Columns

Next, we drop all columns which cannot be used for our prediction question, as they are realized after the outcome of interest, namely arrested_flag, or are irrelevant for other reasons. We drop spatial features other than x and y coordinates of the stop, as this is the most granular spatial information and we use this to map each stop to census tracts to obtain demographic information corresponding to the location of the stop.These dropped spatial features also have high cardinality, which would add many dummies to our model, undermining computational efficiency.

sqf_data <- sqf_data %>% 
  select(- c("stop_frisk_date", "record_status_code", "supervising_action_corresponding_activity_log_entry_reviewed", "stop_location_sector_code", "stop_location_apartment", "stop_location_full_address", "stop_location_patrol_boro_name", "stop_location_street_name", "suspect_other_description", "observed_duration_minutes", "stop_duration_minutes", "summons_issued_flag", "supervising_officer_command_code", "issuing_officer_command_code", "stop_location_precinct", "year2", "suspect_arrest_offense"))

# check new dim
dim(sqf_data)
## [1] 16971    65

4.3 Missing Values

First, we note that there is only 1 column with any instance of an NA value.

na_cols <- colMeans(is.na(sqf_data)) * 100 
na_cols[na_cols > 0]
## demeanor_of_person_stopped 
##                   15.01385

The process generating the missingness of demeanor_of_person_stopped is unclear.

sqf_data[1:5, "demeanor_of_person_stopped"]
## # A tibble: 5 × 1
##   demeanor_of_person_stopped      
##   <chr>                           
## 1 YELLING AND KNOCKING ON THE DOOR
## 2 VILIGANT                        
## 3 UPSET                           
## 4 CALM                            
## 5 EVASIVE

Imputation of this would be difficult, so we drop this column.

# drop 
sqf_data <- sqf_data %>% 
  select(-("demeanor_of_person_stopped"))

# check new dim
dim(sqf_data)
## [1] 16971    64

Additionally, there are many observations in the data with values == (null) across different columns.

# get % of nulls, in columns with at least one null
null_cols <- (colMeans(sqf_data == "(null)") * 100)[colMeans(sqf_data == "(null)") * 100 > 0]

# make df for plot
null_cols_df <- data.frame(Feature = names(null_cols), Percentage = null_cols)

dim(null_cols_df)
## [1] 47  2
# order for plot
null_cols_df$Feature <- factor(null_cols_df$Feature, 
                              levels = null_cols_df$Feature[order(null_cols_df$Percentage, decreasing = FALSE)])

# plot
ggplot(null_cols_df, aes(x = Feature, y = Percentage)) +
  geom_bar(stat = "identity", fill = "lightblue", color = "darkblue") +
  labs(title = "Percentage of (null) Values per Column", 
       x = "Columns", 
       y = "Percentage of (null) Values") +
  coord_flip() +  # Flip coordinates
  theme_minimal()

Note, however, that not all of these (null) observations are equivalent:

  • in some columns - particularly those with lower percentages of (null) values, (null) means the data are genuinely effectively NA, as there are instances of both “Y” and “N” (for binary variable for example), alongside (null).
sqf_data %>% 
  group_by(ask_for_consent_flg) %>% 
  summarise(N = n()) %>% 
  kable()
ask_for_consent_flg N
(null) 779
N 14187
Y 2005
  • whereas in other cases, the null values are actually used as “N”.
print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
sqf_data %>% 
  group_by(weapon_found_flag, firearm_flag) %>% 
  summarise(N = n()) %>% 
  kable()
## `summarise()` has grouped output by 'weapon_found_flag'. You can override using
## the `.groups` argument.
weapon_found_flag firearm_flag N
N (null) 14310
N Y 28
Y (null) 1495
Y Y 1138
# note that for the identifying variables related to cops, "yes" entries are indicated unusually
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "(null)" "I"
print(unique(sqf_data$verbal_identifies_officer_flag))
## [1] "(null)" "V"
print(unique(sqf_data$shield_identifies_officer_flag))
## [1] "(null)" "S"

Note here that even though a firearm_flag has a "Y" entry and weapon_found_flag has a "N" entry, this is not necessarily incorrect, as the officer can have identified a firearm without having to carry out a frisk nor a `search.

We deal with these cases of (null) separately:

  • we replace the second type of (null) with "N" values
# initialize empty vector
null_2 <- c()

# loop through columns
# loop through columns
for (col in names(sqf_data)) {
  # Get unique values of the column
  unique_values <- unique(sqf_data[[col]])
  
  # Check if unique values are exactly a subset of "Y", "I", "V", "S", and "(null)"
  if (all(unique_values %in% c("Y", "I", "V", "S", "(null)")) && length(unique_values) == 2) {
    null_2 <- c(null_2, col)  # Add column name to null_2
  }
}

# check n of type 2 nulls
length(null_2)
## [1] 30
# pre-clean check examples
print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "(null)" "I"
# replace these nulls with Ns
sqf_data <- sqf_data %>%
  mutate(across(all_of(null_2), ~ ifelse(. == "(null)", "N", .)))

# post-clean check examples
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "N" "I"
  • now replace the first type with actual NA values:
# initialize empty vector
null_1 <- c()

# loop through columns
for (col in names(sqf_data)) {
  
  # for columns not in null_2
  if (!(col %in% null_2)) {
    # if "(null)" is present in the column
    if ("(null)" %in% sqf_data[[col]]) {
      null_1 <- c(null_1, col)  # add column name to the vector
    }
  }
}

# check length
length(null_1)
## [1] 17
# pre-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N"      "Y"      "(null)"
# replace these with NAs
sqf_data <- sqf_data %>%
  mutate(across(all_of(null_1), ~ ifelse(. == "(null)", NA, .)))

# post-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N" "Y" NA

Now, we percentage of actual missing values, correctly identified by "NA":

# get % of NAs, in columns with at least one NA
na_cols <- (colMeans(is.na(sqf_data)) * 100)[colMeans(is.na(sqf_data)) * 100 > 0]

# make df for plot
na_cols_df <- data.frame(Feature = names(na_cols), Percentage = na_cols)

# order for plot
na_cols_df$Feature <- factor(na_cols_df$Feature, 
                              levels = na_cols_df$Feature[order(na_cols_df$Percentage, decreasing = FALSE)])

# plot
ggplot(na_cols_df, aes(x = Feature, y = Percentage)) +
  geom_bar(stat = "identity", fill = "#F8566D", color = "black") +
  labs(title = "Percentage of NA Values per Column", 
       x = "Columns", 
       y = "Percentage of NA Values") +
  coord_flip() +  # Flip coordinates
  theme_minimal()

Given the above, we

  • drop columns where more than 25% of observations are missing
sqf_data <- sqf_data %>% 
  select(-all_of(names(na_cols[na_cols > 25])))

dim(sqf_data)
## [1] 16971    59
  • drop the remaining observations where there are missing values
sqf_data <- sqf_data %>%
  filter(!if_any(everything(), is.na))

dim(sqf_data)
## [1] 12095    59

4.4 Coding of Binary Variables

We change the coding of binary variables:

# pre check
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "N" "I"
# clean Ys and Ns and set as numeric
sqf_data <- sqf_data %>%
  mutate(across(
    where(~ all(. %in% c("Y", "N", "I", "V", "S"))), 
    ~ as.numeric(ifelse(. %in% c("Y", "I", "V", "S"), 1, 0))
  ))

# post check
print(unique(sqf_data$firearm_flag))
## [1] 0 1
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] 0 1

4.5 Feature Transformation/Engineering

We also bin the time of the stop from stop_frisk_time:

sqf_data <- sqf_data %>%
  mutate(
    time_of_day = case_when(
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("00", "01", "02", "03", "04", "05") ~ "Late Night",
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("06", "07", "08", "09", "10", "11") ~ "Morning",
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("12", "13", "14", "15", "16", "17") ~ "Afternoon",
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("18", "19", "20", "21", "22", "23") ~ "Evening",
      TRUE ~ NA_character_
    ),
    time_of_day = factor(time_of_day, levels = c("Late Night", "Morning", "Afternoon", "Evening"))
  )

# check
print(table(sqf_data$time_of_day))
## 
## Late Night    Morning  Afternoon    Evening 
##       2807        909       2753       5626
# now drop stop frisk time as we will just use time_of_day
sqf_data <- sqf_data %>%
  select(-"stop_frisk_time")

We also convert other relevant variables to factors as needed:

# convert character columns to factors, except for stop location x and y
sqf_data <- sqf_data %>%
  mutate(across(
    .cols = where(is.character) & !c("stop_location_x", "stop_location_y"),
    .fns = as.factor
  ))

Next, we will make suspect_height, suspect_weight, and suspect_reported_age numeric keeping in mind the factor levels to ensure R converts them correctly. Then, we will address outliers in the data.

# define convert factor to numeric, handling non-numeric entries
convert_to_numeric <- function(x) {
  # convert to character to avoid factor levels issues
  x <- as.character(x)
 
  # replace non-numeric values with NA (e.g., "unknown", "760", "7.6")
  x <- gsub("[^0-9.]", "", x)
 
  # convert to numeric
  as.numeric(x)
}

# apply the function to the relevant columns
sqf_data <- sqf_data %>%
  mutate(
    suspect_reported_age = convert_to_numeric(suspect_reported_age),
    suspect_height = convert_to_numeric(suspect_height),
    suspect_weight = convert_to_numeric(suspect_weight)
  )

# check to make sure successful
summary(sqf_data$suspect_reported_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.00   25.00   27.77   34.00  118.00
summary(sqf_data$suspect_height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   5.400   5.700   5.649   5.900   7.600
summary(sqf_data$suspect_weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   150.0   160.0   166.3   180.0   760.0

Before addressing outliers in the three numerical features, we first need to transform the height variable as it is currently in the format of feet.inches. This creates an issue as an observation of 5.10 means 5 feet 10 inches, but 5.1 means 5 feet 1 inch. We will transform the inches into feet so each observation is only in feet.

It should be noted that in the raw data we have values such as 5.90, which really means 5 feet 9 inches, not 5 feet 90 inches. To address this during the conversion, we checked to see if the inches amount had 1 or 2 decimals and handled accordingly. We also ensured all the inches amounts were in between 0 and 11.9 to ensure none were accidentally converted incorrectly.

# Function to convert feet.inches to feet
convert_to_feet <- function(feet_inches) {
  # Extract feet (integer part)
  feet <- floor(feet_inches)
  
  # Get the fractional part
  fractional_part <- feet_inches - feet
  
  # Interpret inches based on the fractional part
  if (fractional_part == 0) {
    # No fractional part means no additional inches
    inches <- 0
  } else {
    # Convert fractional part to a string to check its length
    fractional_str <- as.character(fractional_part)
    
    if (grepl("\\.\\d$", fractional_str)) {
      # Case like `.1`: Single digit after decimal -> 1, 2, ..., 9 inches
      inches <- fractional_part * 10
    } else if (grepl("\\.\\d0$", fractional_str)) {
      # Case like `.10`, `.20`, etc.: Two digits ending in `0` -> 10, 20, etc. inches
      inches <- fractional_part * 100
    } else {
      # Case like `.11`, `.12`, etc.: Two digits not ending in `0` -> Exact inches
      inches <- fractional_part * 100
    }
  }
  
  # Validate inches (should be between 0 and 11)
  if (inches < 0 || inches > 11.9) {
    warning(paste("Invalid height input:", feet_inches, "- Inches must be between 0 and 11.9. Returning NA."))
    return(NA) # We have no NAs, so inches were extracted correctly. 
  }
  
  # Convert inches to feet
  inches_in_feet <- inches / 12
  
  # Return total height in feet
  return(feet + inches_in_feet)
}

# Apply the conversion function to the 'suspect_height' column
sqf_data$suspect_height_feet <- sapply(sqf_data$suspect_height, convert_to_feet)

# Check the result
tail(sqf_data[, c("suspect_height", "suspect_height_feet")])
## # A tibble: 6 × 2
##   suspect_height suspect_height_feet
##            <dbl>               <dbl>
## 1           5.7                 5.58
## 2           5.11                5.92
## 3           5.9                 5.75
## 4           5.8                 5.67
## 5           6.2                 6.17
## 6           5.8                 5.67
sqf_data$suspect_height <- sqf_data$suspect_height_feet

#drop suspect
sqf_data <- sqf_data %>% dplyr::select(-suspect_height_feet)

We now plot the density of height in feet to see the distribution and identify outliers. We are going to drop observations below 4 ft and above 7 ft.

# compute density for suspect_height
height_density <- density(sqf_data$suspect_height, na.rm = TRUE)

# plot the density
plot(height_density,
     main = "Density of Height with Outliers Highlighted",
     xlab = "Height",
     ylab = "Density",
     col = "black",
     lwd = 2)
grid()

# identify and highlight outliers (e.g., heights below 4 feet and above 7 feet)
outliers <- sqf_data$suspect_height[sqf_data$suspect_height < 4 | sqf_data$suspect_height > 7]

# add vertical lines to mark the outlier boundaries
abline(v = c(4, 7), col = "darkorchid2", lty = 2, lwd = 2)

# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkorchid2", pch = 19)

# drop outlier observations where height is above 7 ft and below 4 ft.
sqf_data <- sqf_data[sqf_data$suspect_height >= 4 & sqf_data$suspect_height <= 7, ]

We plot the density of age to see the distribution. Since we have some outliers, we dropped any observations where age is below 10 and above 85. We also noticed the suspect_reported_age data is quiet skewed, so we applied a logarithmic transformation to this variable.

# compute density for reported_age
reported_age_density <- density(sqf_data$suspect_reported_age, na.rm = TRUE)

# plot the density
plot(reported_age_density,
     main = "Density of Reported Age with Outliers Highlighted",
     xlab = "Reported Age",
     ylab = "Density",
     col = "black",
     lwd = 2)
grid()

# identify and highlight outliers (ages < 10 and > 85)
outliers <- sqf_data$suspect_reported_age[sqf_data$suspect_reported_age < 10 | sqf_data$suspect_reported_age > 85]

# add vertical lines to mark the outlier boundaries
abline(v = c(10, 85), col = "deeppink", lty = 2, lwd = 2)

# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "deeppink", pch = 19)

# drop outlier observations where age is above 85 and below 10.
sqf_data <- sqf_data[sqf_data$suspect_reported_age >= 10 & sqf_data$suspect_reported_age <= 85, ]
#Since the `suspect_reported_age` data is quiet skewed, we will perform a logarithmic transformation. 
sqf_data$suspect_reported_age <- log(sqf_data$suspect_reported_age)

We plot the density of weight to identify outliers. The weight variable is measured in pounds and it appears that some observations could have data entry errors (for example: 9, 16, 760). We will drop outliers above 300 lbs and below 90 lbs.

# compute density for suspect_weight
weight_density <- density(sqf_data$suspect_weight, na.rm = TRUE)

# plot the density
plot(weight_density,
     main = "Density of Suspect Weight with Outliers Highlighted",
     xlab = "Weight",
     ylab = "Density",
     col = "black",
     lwd = 2)
grid()

# identify and highlight outliers (e.g., weights below 90 lbs and above 300 lbs)
outliers <- sqf_data$suspect_weight[sqf_data$suspect_weight < 90 | sqf_data$suspect_weight > 300]

# add vertical lines to mark the outlier boundaries
abline(v = c(90, 300), col = "darkturquoise", lty = 2, lwd = 2)

# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkturquoise", pch = 19)

# drop outlier observations where height is above 350 lbs. and below 90 lbs.
sqf_data <- sqf_data[sqf_data$suspect_weight >= 90 & sqf_data$suspect_weight <= 300, ]

Finally, we will standardize our numerical variables as they use different scales. This will allow us to better compare coefficients of these variables.

# standardize the numeric variables
sqf_data <- sqf_data %>%
  mutate(
    suspect_reported_age = scale(suspect_reported_age),
    suspect_height = scale(suspect_height),
    suspect_weight = scale(suspect_weight)
  )

# check if the standardization worked by summarizing the variables
summary(sqf_data$suspect_reported_age)
##        V1         
##  Min.   :-2.3918  
##  1st Qu.:-0.7609  
##  Median :-0.0636  
##  Mean   : 0.0000  
##  3rd Qu.: 0.7177  
##  Max.   : 2.8599
summary(sqf_data$suspect_height)
##        V1          
##  Min.   :-5.25715  
##  1st Qu.:-0.56747  
##  Median :-0.07382  
##  Mean   : 0.00000  
##  3rd Qu.: 0.41983  
##  Max.   : 3.62856
summary(sqf_data$suspect_weight)
##        V1         
##  Min.   :-2.4893  
##  1st Qu.:-0.5337  
##  Median :-0.2078  
##  Mean   : 0.0000  
##  3rd Qu.: 0.4440  
##  Max.   : 4.3551

5 Exploratory Data Analysis

5.1 Basic Summary Statistics

# check dim again
dim(sqf_data)
## [1] 12044    59
# tabulation of the dependent variable 
sqf_data %>% 
  group_by(suspect_arrested_flag) %>% 
  summarise(N = n(),
            Pc = N / nrow(sqf_data) * 100) %>% 
  arrange(desc(N)) %>% 
  kable(booktabs = TRUE, col.names = c("Suspect Arrested", "N Stops", "% Total Stops"), align = "l")
Suspect Arrested N Stops % Total Stops
0 8270 68.6649
1 3774 31.3351
# looking at the distribution of sex
ggplot(sqf_data, aes(x = suspect_sex, fill = suspect_sex)) +
  geom_bar() +
  labs(
    title = "Distribution of Suspect Sex", 
    x = "Sex", 
    y = "N Stops"
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c("MALE" = "lightblue", "FEMALE" = "pink") 
  ) +
  theme(legend.position = "none")

# sex by arrest status
ggplot(sqf_data, aes(x = suspect_sex, fill = factor(suspect_arrested_flag))) +
  geom_bar(position = "fill") +
  labs(title = "Distribution of Arrest Status, By Suspect Sex", 
       x = "Sex", 
       y = "% Stops") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(type = "qual", palette = "Pastel2", name = "Suspect Arrested") 

# empirical cdf of age by sex and arrest status
ggplot(sqf_data, aes(x = suspect_reported_age, color = factor(suspect_arrested_flag))) +
  stat_ecdf(geom = "step") +
  facet_wrap(~ suspect_sex, ncol = 2) +
  scale_color_manual(values = c("0" = "red", "1" = "darkgreen"),
                     labels = c("Not Arrested", "Arrested"),
                     name = "Arrest Outcome") +
  labs(x = "Suspect Reported Age", y = "ECDF", title = "Empirical CDF of Suspect Reported Age, By Sex and Arrest Status") +
  theme_minimal()

# arrests by race, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  xlab("Suspect Race") +
  ylab("N Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspect Race")

# arrests by race, stacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
  geom_bar(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  xlab("Suspect Race") +
  ylab("% Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspect Race")

# arrests by suspected crime description, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspected_crime_description)), fill = factor(suspect_arrested_flag))) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  xlab("Suspected Crime") +
  ylab("N Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspected Crime")

# arrests by suspected crime description, stacked
ggplot(data = sqf_data, aes(x = suspected_crime_description, fill = factor(suspect_arrested_flag))) +
  geom_bar(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  xlab("Suspected Crime") +
  ylab("% Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspect Crime")

# heatmap of arrests by time of day and borough
ggplot(
  sqf_data %>%
    group_by(stop_location_boro_name, time_of_day) %>%
    summarise(
      percent_arrested = mean(suspect_arrested_flag, na.rm = TRUE) * 100,
      .groups = "drop"
    ),
  aes(
    x = time_of_day,
    y = stop_location_boro_name,
    fill = percent_arrested
  )
) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(name = "Arrest Rate (%)") +
  labs(
    title = "Heat Map of Arrest Rates",
    x = "Time of Day",
    y = "Borough"
  ) +
  theme_minimal()

# heatmap of arrests by issuing officer rank and suspect race
ggplot(
  sqf_data %>%
    group_by(issuing_officer_rank, suspect_race_description) %>%
    summarise(
      percent_arrested = mean(suspect_arrested_flag, na.rm = TRUE) * 100,
      .groups = "drop"
    ),
  aes(
    x = issuing_officer_rank,
    y = suspect_race_description,
    fill = percent_arrested
  )
) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(name = "Arrest Rate (%)") +
  labs(
    title = "Heat Map of Arrest Rates",
    x = "Officer in Uniform",
    y = "Suspect Race"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

5.2 Spatial Data Visualisation

We first clean the data for spatial mapping using the sf and nycgeo packages.

We use this to obtain information about the stops at the census tract level, due to its granularity and the availability of population statistics at this level. insert why relevant for prediction

# drop 7 observations which have incorrect spatial info
sqf_data <- sqf_data %>% 
  filter(stop_location_x > 0)

dim(sqf_data)
## [1] 12037    59
# make spatial object for mapping
sqf_data_sf <- st_as_sf(sqf_data, 
                        coords = c("stop_location_x", "stop_location_y"), 
                        crs = 2263)  #  crs for New York (EPSG:2263)

# load in nta-level shapefile
remotes::install_github("mfherman/nycgeo")
## Skipping install of 'nycgeo' from a github remote, the SHA1 (4fee55c1) has not changed since last install.
##   Use `force = TRUE` to force installation
library(nycgeo)
nyc_tract_shp <- nycgeo::nyc_boundaries(geography = "tract", add_acs_data = TRUE)
## old-style crs object detected; please recreate object with a recent sf::st_crs()
# check crs
st_crs(nyc_tract_shp)$epsg
## [1] 2263

5.2.1 Stop-Level Maps

# plot data onto shapefile by arrest status
ggplot() +
  geom_sf(data = nyc_tract_shp, fill = "lightblue", color = "black", size = 0.3) +
  geom_sf(data = sqf_data_sf, aes(color = as.factor(suspect_arrested_flag)), size = 0.7) +
  scale_color_manual(values = c("red", "seagreen"),  
                     labels = c("Arrested", "Not Arrested")) +
  theme_minimal() +
  labs(title = "NYC Police Stops by Arrest Status") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        legend.title = element_blank())

5.2.2 Tract Level Maps

Perhaps it is more informative to view the percentage of stops ending in arrest at the tract level:

# join datasets to assign each stop to a tract
sqf_data_sf <- st_join(sqf_data_sf, nyc_tract_shp)
dim(sqf_data_sf)
## [1] 12037   104
# aggregate to tract level
sqf_data_sf_tract_level <- sqf_data_sf %>%
  filter(!is.na(geoid)) %>%
  group_by(geoid) %>%
  summarize(pc_arrest = (sum(suspect_arrested_flag) / n()) * 100)

# join with shp for mapping
sqf_data_sf_tract_level <- nyc_tract_shp %>%
  st_join(sqf_data_sf_tract_level, by = "geoid")
  
ggplot() +
  geom_sf(data = sqf_data_sf_tract_level, aes(fill = pc_arrest), color = "black", size = 0.3) +
  scale_fill_viridis_c(
    name = "% Stops Ending in Arrest",
    option = "inferno",
    na.value = "white"
  ) +
  theme_void() +
  labs(title = "Percentage of Stops Ending in Arrest by NYC Census Tract")

5.2.3 Tract-Level Population Predictors

Additionally, the nycgeo package allows us to link with census tract level American Community Survey (2013-2017) population statistics.

insert justification for why these - demographic stats of location of the stop - might be predictors of arrest

We visualize those here:

# we first create the new variables in the shapefile for ease of plotting
nyc_tract_shp <- nyc_tract_shp %>%
  mutate(pc_pop_black_est = (pop_black_est / pop_total_est) * 100,
         pc_pop_hisp_est = (pop_hisp_est / pop_total_est) * 100,
         pc_pop_asian_est = (pop_asian_est / pop_total_est) * 100,
         pc_pop_ba_above_est = (pop_ba_above_est / pop_total_est) * 100,
         pc_pop_inpov_est = (pop_inpov_est / pop_total_est) * 100
  )
# non hispanic black
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_black_est)) +
  scale_fill_viridis_c(
    name = "Non-Hispanic Black % Total Population",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "Non-Hispanic Black Population by Census Tract, ACS 2013-2017")

# hispanic any
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_hisp_est)) +
  scale_fill_viridis_c(
    name = "Hispanic Any Race % Total Population",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "Hispanic Any Race Population by Census Tract, ACS 2013-2017")

# non
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_asian_est)) +
  scale_fill_viridis_c(
    name = "Non-hispanic Asian % Total Population",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "Non-hispanic Asian  Population by Census Tract, ACS 2013-2017")

# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_ba_above_est)) +
  scale_fill_viridis_c(
    name = "% Population Aged >= 25 with at Least Bachelors Degree",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "% Population Aged >= 25 with at least a Bachelor's Degree by Census Tract, ACS 2013-2017")

# income below pov
# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_inpov_est)) +
  scale_fill_viridis_c(
    name = "% Population With Income Below Poverty Line",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "% Population with Income Below Poverty Line, ACS 2013-2017")

We keep only these predictors from the ACS data as predictors for our analysis, as shown below.

# now we create the new variables in the sqf_data_sf object, before merging as appropriate by stop_id into the original data
sqf_data_sf <- sqf_data_sf %>%
  mutate(pc_pop_black_est = (pop_black_est / pop_total_est) * 100,
         pc_pop_hisp_est = (pop_hisp_est / pop_total_est) * 100,
         pc_pop_asian_est = (pop_asian_est / pop_total_est) * 100,
         pc_pop_ba_above_est = (pop_ba_above_est / pop_total_est) * 100,
         pc_pop_inpov_est = (pop_inpov_est / pop_total_est) * 100
  )

# check current dim
dim(sqf_data)
## [1] 12037    59
sqf_data <- sqf_data %>%
  # left join selected spatial features from the sf object into sqf_data
  left_join(sqf_data_sf %>% select(stop_id, pc_pop_ba_above_est, pc_pop_inpov_est, pc_pop_asian_est, pc_pop_hisp_est, pc_pop_black_est), by = "stop_id") %>%
  # drop x,y coords and geometry as we use census tract for spatial info
  select(-c("stop_location_x", "stop_location_y", "geometry")) %>%
  # drop obs with missing values in these spatial features
  filter(!if_any(everything(), is.na))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# check new dim
dim(sqf_data)
## [1] 11957    62

6 Regression Analysis

For our logistic and penalized regressions, we split our sample in two:

  • training data
  • test data

We perform this below:

# set seed for reproducibility
set.seed(1)

# set y and X matrix
y <- sqf_data$suspect_arrested_flag
X <- model.matrix(~ . - suspect_arrested_flag - stop_id -search_basis_incidental_to_arrest_flag, data = sqf_data)

# perform train-test split
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
y_train <- y[train_index]
y_test <- y[-train_index]
X_train <- X[train_index,]
X_test <- X[-train_index, ]

# print lengths and dimensions
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 8370
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 3587
cat("Dimensions of X_train:", dim(X_train), "\n")
## Dimensions of X_train: 8370 157
cat("Dimensions of X_test:", dim(X_test), "\n")
## Dimensions of X_test: 3587 157
# check balance of y
print(table(y_train))
## y_train
##    0    1 
## 5752 2618
print(table(y_test))
## y_test
##    0    1 
## 2465 1122

Note that, for each model type, we estimate using two sets of matrices of predictors to test sensitivity of prediction accuracy:

  • the first (X_train and X_test) simply drops stop_id, due to its irrelevance for prediction, and also search_basis_incidental_to_arrest_flag, as this indicates a search was carried out after an arrest. As we are ultimately interested in predicting arrest status ex-ante to the stop from the police officer’s perspective [edit]

  • the second set (X_train_subset and X_test_subset) also drops all further variables related to search, frisk and physical force. We drop these for a conservative approach to prediction, as it is unclear when these occur in relation to the arrest [edit].

We create this subset below:

# set x subset, removing anything that is/might be realized during the stop
X_subset <- X[, !grepl("^(search|physical_force|.*eye_color|.*hair_color)", colnames(X)) &
                 !colnames(X) %in% c("frisked_flag", "firearm_flag", "knife_cutter_flag",
                                     "other_weapon_flag", "weapon_found_flag", "other_contraband_flag")]


# perform train-test split
X_train_subset <- X_subset[train_index, ]
X_test_subset <- X_subset[-train_index, ]

cat("Dimensions of X_train_subset:", dim(X_train_subset), "\n")
## Dimensions of X_train_subset: 8370 114
cat("Dimensions of X_test_subset:", dim(X_test_subset), "\n")
## Dimensions of X_test_subset: 3587 114

Next, note that we also require a validation dataset for our tuning of hyperparameters in our tuned elastic net and random forest models.

For this reason, we split X_train_subset further into:

  • X_train_subset_final
  • X_val_subset

and correspondingly, y_train further into:

  • y_train_final
  • y_val

We only do these for the subset, as we only tune these models for the subset for computational reasons.2

We create these matrices here:

# subset training data for validation split
n_train <- nrow(X_train)
id_split <- sample(1:n_train, floor(0.5 * n_train))

# split into training and validation sets
X_train_subset_final <- X_train_subset[id_split, ]
y_train_final <- y_train[id_split]
X_val_subset <- X_train_subset[-id_split, ]
y_val <- y_train[-id_split]

# print dimensions
cat("dimensions of X_train_final:", dim(X_train_subset_final), "\n")
## dimensions of X_train_final: 4185 114
cat("dimensions of y_train_final:", length(y_train_final), "\n")
## dimensions of y_train_final: 4185
cat("dimensions of X_val:", dim(X_val_subset), "\n")
## dimensions of X_val: 4185 114
cat("dimensions of y_val:", length(y_val), "\n")
## dimensions of y_val: 4185

We now proceed by running our regression models, evaluating each of them based on their predictive performance in the test data.

Note that as we are working in a binary classification setting, we are looking to minimize misclassification error. The cost ratio in this framework is: \[C = \frac{L(1,0)}{L(0,1)}\].

This measures how costly false negatives are relative to false positives. The optimal decision is \[\hat{y_i} = 1 \iff p_i > {1 \over 1+C}\]

When looking at our confusion matrices, we choose \(C=1\) and weight both types of misclassification equally, and so classify according to the most likely class i.e \[p_i > 0.5 \longrightarrow \hat{y} = 1\].

6.1 Logistic Regression

Our logistic regression model will serve as a simple baseline model. As it is a simple model, we will use X_train_subset, as this is a more conservative approach as we drop the variables that may have occurred before or after arrests.

Since logistic regression is sensitive to multicollinearity, we also drop the factor variables supervising_officer_rank as they are closely related to issuing_officer_rank, as it is likely issuing_officers of a specific rank have the same rank of supervisors.

# Set X_logit, remove supervising_officer_rank to avoid rank-deficient fit 
X_logit <- X_subset[, !grepl("^(supervising_officer_rank)", colnames(X_subset))]

Now we set the training and testing split using the X_logit data and run our model.

# perform train-test split on X_logit
X_train_logit <- X_logit[train_index, ]
X_test_logit <- X_logit[-train_index, ]

# print lengths and dimensions
cat("Dimensions of X_train_logit:", dim(X_train_logit), "\n")
## Dimensions of X_train_logit: 8370 101
cat("Dimensions of X_test_subset:", dim(X_test_logit), "\n")
## Dimensions of X_test_subset: 3587 101
# glm requires dataframes as arguments
X_train_logit_df <- as.data.frame(X_train_logit)
X_test_logit_df <- as.data.frame(X_test_logit)

# full logit model in training data
logit <- glm(y_train ~ ., data = X_train_logit_df, family = binomial(logit))

# get fitted probabilities using trained model on test data
logit_predict <- as.numeric(predict(logit, newdata = X_test_logit_df, type = "response"))

# convert probabilities to class predictions using a threshold of 0.5
logit_y_hat <- ifelse(logit_predict > 0.5, 1, 0)

# inspect class balance of predicted classes
table(logit_y_hat)  
## logit_y_hat
##    0    1 
## 2828  759
# define a function to generate and plot confusion matrix
generate_cm <- function(true, predicted, title) {
   # gen confusion matrix as a data frame
   cm <- as.data.frame(table(True = true, Predicted = predicted)) %>%
   group_by(Predicted) %>%
   mutate(Predicted_pct = Freq / sum(Freq))
 
  # print cm
  print(cm)
 
  # plot cm
  plot <- ggplot(data = cm, mapping = aes(x = ordered(True, c(1, 0)), y = Predicted, fill = Predicted_pct)) +
    geom_tile() +
    geom_text(aes(label = round(Predicted_pct, 2)), color = 'white') +
    scale_fill_gradient(low = "blue", high = "red", name = "Rel. Freq.") +
    xlab("True") +
    ylab("Predicted") +
    labs(title = title) +
    theme_minimal()
 
  # print plot
  print(plot)
  
}

# plot confusion matrix for logistic regression with all variables
generate_cm(y_test, logit_y_hat, "Confusion Matrix for Logistic Regression")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2192         0.775
## 2 1     0           636         0.225
## 3 0     1           273         0.360
## 4 1     1           486         0.640

# compute ROC and AUC
logit_roc <- roc(response = y_test, predictor = logit_predict, quiet = TRUE)  
logit_auc <- round(auc(logit_roc), 4)  
cat("AUC for the logit model", logit_auc, "\n")
## AUC for the logit model 0.7751

6.2 LASSO

Next we implement LASSO, where: \[ \beta^{LASSO} = \arg\min_{\beta} \sum_{i=1}^n\Big( y_i - x_i' \beta \Big)^2\] subject to \[ \sum_{j=1}^{p} |\beta_j| < t \]

We implement cross-validation in our training sample as [edit]. We use the default \(k=10\) folds for computational efficiency.

As we are operating in a classification setting, we choose the value of the parameter lambda that maximises the area under the ROC curve, and implement this by selecting type.measure = "class" in our cv.glmnet regressions.

# run lasso on training data to collect coefficients
lasso <- cv.glmnet(x=X_train, y=y_train, alpha = 1, family="binomial", type.measure = "class")

# optimal lambda
lasso_lambda_min <- lasso$lambda.min
cat("Optimal Lambda for Lasso:", lasso_lambda_min, "\n")
## Optimal Lambda for Lasso: 0.0009253621
# plot misclassification error against log lambda
plot(lasso)
abline(v = log(lasso_lambda_min), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (LASSO - Full)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "Misclassification Error")

# plot coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO (Full Model)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities, using best lambda
lasso_predict <- as.numeric(predict(lasso, s = lasso_lambda_min, X_test, type = "response"))

# add variable for predicted classes
lasso_y_hat <- ifelse(lasso_predict > 0.5, 1, 0)

# plot confusion matrix for lasso regression out of sample
generate_cm(y_test, lasso_y_hat, "Confusion Matrix for LASSO (Full Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2348        0.910 
## 2 1     0           231        0.0896
## 3 0     1           117        0.116 
## 4 1     1           891        0.884

# compute ROC and AUC
lasso_roc_full <- roc(response = y_test, predictor = lasso_predict, quiet = TRUE)
lasso_auc_full <- round(auc(lasso_roc_full), 4)  
cat("AUC for the LASSO (full) model", lasso_auc_full, "\n")
## AUC for the LASSO (full) model 0.9464

6.2.1 LASSO Sensitivity Analysis - Subset

# run lasso on training data to collect coefficients
lasso_subset <- cv.glmnet(x=X_train_subset, y=y_train, alpha = 1, family="binomial", type.measure = "class")

lasso_lambda_min_subset <- lasso_subset$lambda.min
cat("Optimal Lambda for Lasso (Subset):", lasso_lambda_min_subset, "\n")
## Optimal Lambda for Lasso (Subset): 0.007069799
# plot misclassification error against log lambda
plot(lasso_subset)
abline(v = log(lasso_lambda_min_subset), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (LASSO - Subset)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "Misclassification Error")

# plot coefficients
plot(lasso_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO (Subset)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities
lasso_predict_subset <- as.numeric(predict(lasso_subset, s = lasso_lambda_min_subset, X_test_subset, type = "response"))

# add variable for predicted classes
lasso_y_hat_subset <- ifelse(lasso_predict_subset > 0.5, 1, 0)

# plot confusion matrix for lasso subset regression out of sample
generate_cm(y_test, lasso_y_hat_subset, "Confusion Matrix for LASSO (Subset Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2294         0.753
## 2 1     0           751         0.247
## 3 0     1           171         0.315
## 4 1     1           371         0.685

# compute roc object for subset model
lasso_roc_subset <- roc(response = y_test, predictor = lasso_predict_subset, quiet = TRUE)
lasso_auc_subset <- round(auc(lasso_roc_subset), 4)  
cat("AUC for the LASSO (subset) model", lasso_auc_subset, "\n")
## AUC for the LASSO (subset) model 0.7754

6.3 Ridge

Next, we implement ridge regression: \[ \beta^{Ridge} = \arg\min_{\beta} \sum_{i=1}^n\Big( y_i - x_i' \beta \Big)^2\] subject to \[ \sum_{j=1}^{p} \beta_j^2 < t \].

# run ridge regression on training data to collect coefficients
ridge <- cv.glmnet(x = X_train, y = y_train, alpha = 0, family = "binomial", type.measure = "class")

ridge_lambda_min <- ridge$lambda.min
cat("optimal lambda for ridge:", ridge_lambda_min, "\n")
## optimal lambda for ridge: 0.02457843
# plot misclassification error against log lambda
plot(ridge)
abline(v = log(ridge_lambda_min), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (Ridge - Full)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "Misclassification Error")

# plot coefficients
plot(ridge$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Full Model)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities
ridge_predict <- as.numeric(predict(ridge, s = ridge_lambda_min, X_test, type = "response"))

# add variable for predicted classes
ridge_y_hat <- ifelse(ridge_predict > 0.5, 1, 0)

# plot confusion matrix for ridge regression out of sample
generate_cm(y_test, ridge_y_hat, "Confusion Matrix for Ridge (Full Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2357        0.903 
## 2 1     0           253        0.0969
## 3 0     1           108        0.111 
## 4 1     1           869        0.889

# compute roc object for full ridge model
ridge_roc_full <- roc(response = y_test, predictor = ridge_predict, quiet = TRUE)
ridge_auc_full <- round(auc(ridge_roc_full), 4)
cat("AUC for the Ridge (full) model:", ridge_auc_full, "\n")
## AUC for the Ridge (full) model: 0.9464

6.3.1 Ridge Sensitivity Analysis - Subset

# run ridge regression on subset predictors
ridge_subset <- cv.glmnet(x = X_train_subset, y = y_train, alpha = 0, family = "binomial", type.measure = "class")

ridge_lambda_min_subset <- ridge_subset$lambda.min
cat("Optimal lambda for ridge (subset):", ridge_lambda_min_subset, "\n")
## Optimal lambda for ridge (subset): 0.1558983
# plot misclassification error against log lambda
plot(ridge_subset)
abline(v = log(ridge_lambda_min_subset), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (Ridge - Subset)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "Misclassification Error")

# plot coefficients
plot(ridge_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Subset)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities
ridge_predict_subset <- as.numeric(predict(ridge_subset, s = ridge_lambda_min_subset, X_test_subset, type = "response"))

# add variable for predicted classes
ridge_y_hat_subset <- ifelse(ridge_predict_subset > 0.5, 1, 0)

# plot confusion matrix for ridge subset regression out of sample
generate_cm(y_test, ridge_y_hat_subset, "Confusion Matrix for Ridge (Subset Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2333         0.743
## 2 1     0           807         0.257
## 3 0     1           132         0.295
## 4 1     1           315         0.705

# compute roc object for subset ridge model
ridge_roc_subset <- roc(response = y_test, predictor = ridge_predict_subset, quiet = TRUE)
ridge_auc_subset <- round(auc(ridge_roc_subset), 4)
cat("AUC for the Ridge (subset) model:", ridge_auc_subset, "\n")
## AUC for the Ridge (subset) model: 0.7681

6.4 Tuned Elastic Net

Now we implement elastic net regression, choosing to tune both \(\lambda\) and \(\alpha\) [insert justification]. For this, we use our validation data, as mentioned.

We tune elastic net in the subset of regressors [just subset here i dont think there is need to do full because then we just have to split more, and we are only doing subset for random forest anyway, takes too long]:

# Define alpha values for grid search
alpha_values <- seq(0, 1, by = 0.1) 

# Initialize storage for results
results <- data.frame(alpha = alpha_values, auc = NA, optimal_lambda = NA)

# loop grid search
for (i in 1:nrow(results)) {
  alpha_i <- results$alpha[i]
  
  # train en model with cross-validation for the given alpha
  en_opt <- cv.glmnet(
    x = X_train_subset_final,
    y = y_train_final,
    alpha = alpha_i,
    family = "binomial",
    type.measure = "class"
  )
  
  # get the best lambda for the current alpha
  en_opt_lambda_min <- en_opt$lambda.min
  
  # get fitted probabilities on the validation set using the best lambda and current alpha
  en_opt_predict_val <- as.numeric(predict(en_opt, s = en_opt_lambda_min, newx = X_val_subset, type = "response"))
  
  # compute roc and aucs of trained model in validation data
  en_opt_roc_val <- roc(response = as.numeric(y_val), predictor = en_opt_predict_val, quiet = TRUE)
  en_opt_auc_val <- auc(en_opt_roc_val)
  
  # add to storage vector
  results$auc[i] <- en_opt_auc_val
  results$optimal_lambda[i] <- en_opt_lambda_min
}

# identify the best alpha
best_params <- results[which.max(results$auc), ]
cat("Optimal alpha:", best_params$alpha, "\n")
## Optimal alpha: 0.2
cat("Optimal lambda:", best_params$optimal_lambda, "\n")
## Optimal lambda: 0.0355349
cat("Best AUC:", best_params$auc, "\n")
## Best AUC: 0.7316938
# train the tuned elastic net model with best hyperparameters
en_opt <- glmnet(
  x = X_train_subset_final,
  y = y_train_final,
  alpha = best_params$alpha,
  lambda = best_params$optimal_lambda,
  family = "binomial"
)

# get fitted probabilities using trained model on test data
en_opt_predict <- as.numeric(predict(en_opt, s = best_params$optimal_lambda, newx = X_test_subset, type = "response"))

# add variable for predicted classes
en_opt_y_hat <- ifelse(en_opt_predict > 0.5, 1, 0)

# plot confusion matrix for tuned elastic net regression out of sample
generate_cm(y_test, en_opt_y_hat, "Confusion Matrix for Elastic Net (Subset Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2297         0.747
## 2 1     0           777         0.253
## 3 0     1           168         0.327
## 4 1     1           345         0.673

# compute test ROC and AUC out of sample
en_opt_roc_subset <- roc(response = as.numeric(y_test), predictor = en_opt_predict, quiet = TRUE)
en_opt_auc_subset <- auc(en_opt_roc_subset)
cat("Test AUC for Tuned Elastic Net model:", en_opt_auc_subset, "\n")
## Test AUC for Tuned Elastic Net model: 0.7708992

6.5 Random Forest

Next, we implement the random forest algorithm [justification].

We do this on the subset of the data as [justification].

We tune the maximum number features of the tree first. We want to refine the maximum number of features to simplify the model. We choose the maximum number of features to loop over as \(m = 20\) and \(ntree \in {50, 100, 150}\), as [elaborate here - mainly computational and want to just simplify the model].

# Initialize parameter grid for grid search
ntrees <- seq(50, 150, 50)  # range of trees to tune over
max_features_range <- 1:20  # Range of mtry  (max features) to tune over

# initialize storage for accuracy metrics
results <- expand.grid(ntree = ntrees, mtry = max_features_range)

dim(results)
## [1] 60  2
results$train_acc <- NA
results$test_acc <- NA
results$oob_acc <- NA

# tune
for (i in 1:nrow(results)) {
  ntrees_i <- results$ntree[i]
  mtry_i <- results$mtry[i]

  # train rf model using training data and current iteration of mtry and ntree
  rf <- randomForest(
    x = X_train_subset_final,
    y = factor(y_train_final),
    mtry = mtry_i,
    ntree = ntrees_i
  )

  # use trained model to predict y in validation data
  y_pred_val <- predict(rf, X_val_subset)
 
  # get trained model predictions of y in training data
  y_pred_train <- rf$predicted

  # compute out of bag accuracy measure
  results$oob_acc[i] <- 1 - rf$err.rate[ntrees_i, "OOB"]
 
  # compute training accuracy measure
  results$train_acc[i] <- mean(y_train_final == as.numeric(levels(y_pred_train)[y_pred_train]))
 
  # compute "test" - validation - accuracy
  results$test_acc[i] <- mean(y_val == as.numeric(levels(y_pred_val)[y_pred_val]))
}

# identify the optimal parameters
best_params <- results[which.max(results$test_acc), ]
cat("Optimal ntree:", best_params$ntree, "\n")
## Optimal ntree: 100
cat("Optimal mtry:", best_params$mtry, "\n")
## Optimal mtry: 13
# define accuracy metrics and y-axis range for tuning plot
accuracy_metrics <- results[results$ntree == best_params$ntree, ]
y_range <- range(c(accuracy_metrics$train_acc, accuracy_metrics$test_acc, accuracy_metrics$oob_acc)) + c(-0.01, 0.01)

# plot train, validation, and OOB accuracy for optimal ntree as mtry varies
plot(
  accuracy_metrics$mtry, accuracy_metrics$train_acc, 
  type = 'l', col = 'blue', ylim = y_range,
  main = 'Tuning mtry for Optimal ntree',
  xlab = 'Number of Features (mtry)', ylab = 'Accuracy'
)
lines(accuracy_metrics$mtry, accuracy_metrics$test_acc, col = 'red')
lines(accuracy_metrics$mtry, accuracy_metrics$oob_acc, col = 'green', lty = 2)
legend('bottomright', 
       legend = c('Train Accuracy', 'Validation Accuracy', 'OOB Accuracy'),
       col = c('blue', 'red', 'green'), lty = c(1, 1, 2), bty = 'n')

# given optimal parameters, train using training data to get tuned model
RFTuned <- randomForest(
  x = X_train_subset_final,
  y = factor(y_train_final),
  mtry = best_params$mtry,
  ntree = best_params$ntree
)

# using model trained with optimal parameters, predict with TEST data
RFPred <- predict(RFTuned, newdata = X_test_subset)

# compute associated relevant fitted probabilities
RFProb <- predict(RFTuned, newdata = X_test_subset, type = "prob")

# compute test roc and auc of tuned model
RF_roc <- roc(response = factor(y_test), predictor = RFProb[, 2], levels = c(0, 1), quiet = TRUE)
RF_auc <- round(auc(RF_roc), 4)
cat("Random Forest AUC (Subset):", RF_auc, "\n")
## Random Forest AUC (Subset): 0.7919
# gen confusion matrix
generate_cm(
  true = y_test,
  predicted = RFPred,
  title = "Confusion Matrix for Random Forest (Subset)"
)
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2271         0.763
## 2 1     0           706         0.237
## 3 0     1           194         0.318
## 4 1     1           416         0.682

7 Results - Comparison

# function to plot ROC curves for multiple models
plot_multiple_roc <- function(roc_list, labels, colors, lwd_list, lty_list, main_title) {
  # Plot the first model as the base plot
  plot(
    x = 1 - roc_list[[1]]$specificities, 
    y = roc_list[[1]]$sensitivities, 
    type = "l",
    col = colors[1], 
    lwd = lwd_list[1], 
    lty = lty_list[1],
    xlim = c(0, 1), 
    ylim = c(0, 1), 
    main = main_title, 
    xlab = "False Positive Rate", 
    ylab = "True Positive Rate",
    cex.lab = 1.5, 
    cex.main = 1.8, 
    cex.axis = 1.2
  )
  
  # Add the remaining ROC curves
  for (i in 2:length(roc_list)) {
    lines(
      x = 1 - roc_list[[i]]$specificities, 
      y = roc_list[[i]]$sensitivities, 
      col = colors[i], 
      lwd = lwd_list[i], 
      lty = lty_list[i]
    )
  }
  
  # Add legend
  legend(
    "bottomright", 
    legend = labels, 
    col = colors, 
    lwd = lwd_list, 
    lty = lty_list, 
    bty = "n"
  )
}

# set list of input rcs and associated labels, colors etc for arguments
roc_list <- list(logit_roc, lasso_roc_full, ridge_roc_full, RF_roc, ridge_roc_subset, lasso_roc_subset, en_opt_roc_subset)
labels <- c(
  sprintf("LASSO Full (AUC: %.4f)", lasso_auc_full),
  sprintf("Ridge Full (AUC: %.4f)", ridge_auc_full),
  sprintf("Logistic Regression (AUC: %.4f)", logit_auc),
  sprintf("Random Forest Subset (AUC: %.4f)", RF_auc),
  sprintf("Ridge Subset (AUC: %.4f)", ridge_auc_subset),
  sprintf("LASSO Subset (AUC: %.4f)", lasso_auc_subset),
  sprintf("Elastic Net Subset (AUC: %.4f)", en_opt_auc_subset)
)
colors <- c("red", "darkgreen", "blue", "darkorange", "lightgreen", "pink", "purple")
lwd_list <- c(3, 3, 2, 2, 2, 2, 2)
lty_list <- c(2, 3, 1, 1, 3, 2, 1)

#call
plot_multiple_roc(
  roc_list = roc_list,
  labels = labels,
  colors = colors,
  lwd_list = lwd_list,
  lty_list = lty_list,
  main_title = "ROC Curves for All Models"
)

# make and print sorted table of AUCs
auc_table <- data.frame(
  Model = c(
    "Logistic Regression",
    "LASSO Full",
    "LASSO Subset",
    "Ridge Full",
    "Ridge Subset",
    "Elastic Net Subset",
    "Random Forest Subset"
  ),
  AUC = c(
    logit_auc,
    lasso_auc_full,
    lasso_auc_subset,
    ridge_auc_full,
    ridge_auc_subset,
    en_opt_auc_subset,
    RF_auc
  )
)

auc_table <- auc_table %>% 
  arrange(desc(AUC))

print(auc_table)
##                  Model       AUC
## 1           LASSO Full 0.9464000
## 2           Ridge Full 0.9464000
## 3 Random Forest Subset 0.7919000
## 4         LASSO Subset 0.7754000
## 5  Logistic Regression 0.7751000
## 6   Elastic Net Subset 0.7708992
## 7         Ridge Subset 0.7681000

Insert interpretation on relative out of sample predictive performance

7.1 Variable Importance

# Function to compute and plot variable importance
compute_variable_importance <- function(coefficients, model_name, top_n = 20) {
  importance_df <- data.frame(
    Variable = names(coefficients),
    Scaled_Importance = abs(coefficients) / max(abs(coefficients), na.rm = TRUE)
  ) %>%
    filter(Scaled_Importance > 0) %>%
    arrange(desc(Scaled_Importance)) %>%
    slice(1:top_n)
  
  ggplot(importance_df, aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
    geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
    coord_flip() +
    labs(
      title = paste("Top", top_n, "Var Imp for", model_name),
      x = "Variable",
      y = "Scaled Importance"
    ) +
    theme_minimal()
}

# 1. Logistic Regression
logit_coeff <- coef(logit) 
logit_coeff <- logit_coeff[-1] 
logit_plot <- compute_variable_importance(logit_coeff, "Logistic Regression")
print(logit_plot)

# 2. LASSO Subset
lasso_subset_coeff <- as.numeric(coef(lasso_subset, s = lasso_lambda_min_subset)) 
names(lasso_subset_coeff) <- rownames(coef(lasso_subset, s = lasso_lambda_min_subset))
lasso_subset_coeff <- lasso_subset_coeff[-1] 
lasso_plot <- compute_variable_importance(lasso_subset_coeff, "LASSO Subset")
print(lasso_plot)

# 3. Ridge Subset
ridge_subset_coeff <- as.numeric(coef(ridge_subset, s = ridge_lambda_min_subset)) 
names(ridge_subset_coeff) <- rownames(coef(ridge_subset, s = ridge_lambda_min_subset))
ridge_subset_coeff <- ridge_subset_coeff[-1] 
ridge_plot <- compute_variable_importance(ridge_subset_coeff, "Ridge Subset")
print(ridge_plot)

# Extract Random Forest importance
rf_importance_subset <- data.frame(
  Variable = rownames(RFTuned$importance),
  Scaled_Importance = RFTuned$importance[, "MeanDecreaseGini"] / max(RFTuned$importance[, "MeanDecreaseGini"], na.rm = TRUE)
)
rf_importance_subset <- rf_importance_subset %>%
  arrange(desc(Scaled_Importance))

# plot rf variable importance - top 20
ggplot(rf_importance_subset %>% arrange(desc(Scaled_Importance)) %>% slice(1:20),
       aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 20 Var Imp for RF",
       x = "Variable",
       y = "Scaled Importance") +
  theme_minimal()

Insert interpretation on importance

8 Conclusion

summary, racial component?

note limitations of our analysis, possible extensions


  1. In the raw data, the flag’s entries are actually "Y" and "N", but we clean this.↩︎

  2. Run time increases significantly if we tune elastic net and random forest for the full set of predictors and the subset of predictors.↩︎